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Abstract. This paper describes a linear-time algorithm that finds the 
longest stretch in a sequence of real numbers (“scores”) in which the sum 
exceeds an input parameter. The algorithm also solves the problem of find¬ 
ing the longest interval in which the average of the scores is above a fixed 
threshold. The problem originates from molecular sequence analysis: for 
instance, the algorithm can be employed to identify long GC-rich regions in 
DNA sequences. The algorithm can also be used to trim low-quality ends 
of shotgun sequences in a preprocessing step of whole-genome assembly. 

1 Introduction 

Let ai,... ,a n be an arbitrary sequence of real numbers with n > 0. The 
segment [i,j\ for 1 < i < j < n is the interval {i,i + 1,... ,j}; its score is 
a(i,j) = at + dj+i + ■ ■ ■ + a-j. This paper’s central problem is the following. 
Given a score threshold a, find a segment [i, j] that has maximum length 
(j — i + 1) among those with a(i,j ) > a. 

Similar segmentation questions are encountered in statistical change- 
point estimation [7], with applications in various areas including molecular 
biology [1, 5, 11]. A number of related problems can be solved with effi¬ 
cient algorithms. Jon Bentley’s classic “programming pearl” finds a seg¬ 
ment with maximum score in O(n) time [3]. Csuros [5] solves the more 
general problem of finding a k- set of segments with maximum total score 
in 0(nmin{A:,logra}) time and O(n) space. Huang [10] reports a simple 
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linear-time algorithm for the dual of our problem, namely, that of finding a 
a segment that has maximum score among those longer than a given thresh¬ 
old. An algorithm of Lin et al. [12] finds such a segment in O(n) time, when 
in addition to a lower bound on the segment length, an upper bound is also 
imposed. 

In some situations, it may be interesting to evaluate a segment [i,j] by 
its average score a(i,j)/(j —i + 1). Lin et al. [12] devised an algorithm that 
finds the segment with maximum average score among those longer than L, 
in 0{n log L) time. Goldwasser et al. [8] give a faster algorithm for the same 
problem that runs in 0(n) time irrespective of L. This paper’s techniques 
lead to an 0(n)-time algorithm for the dual problem; namely, that of finding 
the longest segment with average score above a bound a. This latter result 
is particularly relevant in molecular sequence segmentation. For instance, 
our algorithm can be employed to identify the longest contiguous region 
in a DNA sequence with a GC-content (relative frequency of guanine and 
cytosine) above a cutoff level. The search for GC-rich and GC-poor regions 
in DNA is one of the main practical motivations behind the algorithms 
of [5, 8, 10, 12]. 

Whole-genome shotgun assembly programs also often need to compute 
long segments with high average scores. In shotgun sequencing, the se¬ 
quence of a long DNA molecule is computed from the sequences of ran¬ 
domly sampled short fragments [9], called the shotgun sequences. The shot¬ 
gun sequences are typically delivered together with position-specific error 
probabilities [6] to the assembly software. In a preprocessing phase, many 
assembly programs trim the shotgun sequences by removing the extremities 
with high sequencing error. It is important to trim the sequences only as 
much as is absolutely necessary. Small error levels can be tolerated and 
even corrected, while the assembly’s quality is ultimately determined by 
its length. The shotgun sequence trimming problem is defined as follows. 
Given a DNA sequence S 1 S 2 ■ • • s n and position-specific error probabilities 
ei, e 2 ,..., e n , find the longest contiguous substring Sj... Sj such that its 
average error (ej + ej+i + ... + ej)/(j — i + 1) falls below a user-specified 
threshold E. Clearly, by setting = l-e^, we can look for the longest seg¬ 
ment for which the average score is above (1 — E) by using the techniques in 
this paper. Existing assembly programs trim heuristically using variations of 
a sliding window technique, without guarantees of length optimality. (They 
rely on the fact that the error probabilities in chain-termination sequencing 
are usually high at the extremities and low in the middle, and essentially 
assume a unimodal function.) The assembly program Arachne [2], for in¬ 
stance, purposely looks for the longest segment with an average error below 
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a threshold, but closer inspection of the source code reveals that the imple¬ 
mented algorithm is not guaranteed to find an optimal segment for all error 
probabilities. 

2 Algorithm 

Define the prefix score fj = a(l ,j) for all j = 1 and let /o = 0. 

Obviously, a(i,j) = i, and thus we are looking for the longest segment 

[i.j] with fj > /j_i + a. Now, let 0 < i* < j* < n be such that fj* > fi* + a 
and (j* — i*) is maximal. Clearly, [i* + 1, j*\ is the longest segment with 
a(i* + 1 ,j*) > a. 

Lemma 1. Let 0 < i* < j* < n be such that fj* > fi* + a and ( j* — i*) is 
maximal. If i* > 0, then 


fi* < fo,fi,---,fi*-i- 


(la) 


If j* < n, then 

fj*>fnJn~l,---Jj*+l- (lb) 

Proof. We prove Eq. (la). For the sake of contradiction, assume that there 
exists such an i < i* that fi < fi*. Then j* — i > j* — ?'*, yet fj* > fi* + a > 
fi + a. Eq. (lb) is proven analogously. □ 

Definition 1 . Define the left sequence of minima 0 = l\ < I2 < ■ ■ ■ Ik < n by 
l\ = 0 and lj = min{i: Zj_ 1 < i < n, fi < fi _ 1 } ■ Define the right sequence 

of maxima n = rq > r2 > • • • > r m > 0 by r\ = n and rj = maxjf: 0 < i < 

r j—hfi P frj~ j}- 

Figure 1 illustrates these notions. By definition, the left sequence of 
minima is sorted in decreasing order of prefix scores: 


fh > fh > ’ ’ ’ > fh ■ 


(2a) 


Similarly, 


/ri < /r 2 < • • • < fr m ■ (2b) 

By Lemma 1, we can restrict our attention to segments \i,j\ where i € 
{li... , Ik} and j € {?q,..., r m }. Equations (2a) and (2b) imply the following 
lemmas. 
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/decreases 





^ /increases 


Figure 1: Left sequence of minima {lfi\ and right sequence of maxima {rj}. 

Lemma 2. Let i € {0,..., n}. If fi + a < f rj for some j, then fi + a < f r i . 
for all j' > j. 

Lemma 3. Let j £ {0,..., n}. If f^ + a < fj, then fi + a < fj for all 
i ' > i. 

In view of these lemmas, we can define the following values. 

Definition 2. For all i = 1,..., k define right(z) by right(z) = min{j : f r . > 
fh + a}- Let right(z) = m + 1 if f k + a > f rj for all j. 

For all j = 1,..., m, define left(j) by I eft (j) = min{i: fi t + a < f rj }; let 
left (j) = k + 1 if f h + a > f rj for all i. 

By Lemmas 1, 2 and 3, for every i = 1 the longest segment 

[li + 1, j] which scores above a has j = r right (j), unless right(z) = m + 1 or 
r right(i) ^ h, hi which case there is no suitable segment with left endpoint 
at k + 1. Similarly, left(j) gives the left endpoint i = l\ e ft(j) f° r the longest 
segment [i + 1, rfi that scores above the threshold, unless left(j) = k + 1 or 
/lef t (j) > rj , in which case there is no suitable segment with right endpoint rj. 
Lemmas 2 and 3 imply the following property. 

Lemma 4. For all i < i < a < k, right(z) > right(i / ). For all m > j > j 1 > 
1, left(j) < left(j'). 

Now, the best segment is the longest valid segment in the set 

{\li “b ljf right (i) ] ■ l !>•••) hi { left (j) F f 5 ] . j 1, . . . , Tlf l} . 

In fact, it suffices to consider only one of the two sets since for the longest 
segment [/,* + 1 ,rj*], left(j*) = i* and right(z*) = j*. 

The following algorithm solves the original problem. 

1 Algorithm LongestSegment 
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2 Input: scores a*: i = 1, — ,n; threshold a 

3 Output: longest segment that scores above a , or nil if no segment 
score exceeds a 
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Set 

/o<- 

-0; 

for i 1, 

• • •, n do fi <- fi_ 

1 + 

5 

Set 

k «- 

lb 

!i <-0 
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for 

i <— 

1,- 

.., n do if 

fi < fi k then k <- 

■ k 4~ 1, <— i 

7 

Set 

m e- 

-1, 

r\ n 
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for 

j 

n, 

..., 1 do if 

fj > fr m then rn 

m + l,r m j 

9 

Set 

max 

<- 

0, segment • 

- nil 



10 Set i <— l,j <— m 

11 while i < k and j > 1 do 

12 while i < k and fi z + a > f rj do i i + 1 

13 if i < k then 

14 while j > 1 and fi t + a < f Tj do 

15 if rj — U > max then max <— rj — Zj, segment [l t + 1, rj] 

16 Set j *— j — 1 

17 end while 

18 end if 

19 end while 

20 return segment 

Theorem 1. Algorithm LongestSegment finds the longest segment which 
scores above the threshold a in 0(n ) time. 

Proof. Line 4 calculates the prefix sums /* in 0(n) time. Lines 5-6 compute 
the left sequence of minima, and Lines 7-8 compute the right sequence of 
maxima, in O(n) time. Lines 11 19 are executed in 0{k + m) = 0(n ) 
time, since the loops of Lines 12 and 16 increase i and decrease j by one, 
respectively. 

Line 9 initializes the structures for tracking the best segment: max stores 
the length of the longest segment found and segment is the best segment. 
In Lines 10-19, the algorithm goes through pairs (i, j) where i = left(j). 
More precisely, the algorithm’s correctness follows from the invariant that 
i = k + 1 or i = left(j) holds in Line 13. Subsequently, as long as the while 
loop’s condition in Line 14 is true, i = left(j) holds. As discussed, one of the 
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segments [Z| e ft(j) + 1 , r j] is the longest one that scores above the cutoff, and, 
thus Line 15 finds the optimal segment if the invariant is true. In order to 
see that the invariant is true, notice the following. First, after the condition 
of the loop in Line 12 fails with j = m, the invariant holds by Definition 2 
of left(m). Secondly, for j < m , left(j) can be looked for starting the search 
at left (j + 1) by Lemma 4, and, thus the invariant holds every time the 
execution arrives to Line 13. □ 


3 Related problems 

The same technique applies also to the problem of finding a segment with 
maximum score with a lower bound on the segment length. (Albeit Xiaoqiu 
Huang’s algorithm [10] is arguably simpler.) The idea is to define the left 
and right pairs by thresholding on the segment length and then select the 
one segment with the highest score. 

The described algorithm can also be used to find the longest segment 
with an average score above a given threshold (3. Since a *+ a »+i+^--+ a j > ^ if 

and only if Ylk=i( a k~P) — 0’the longest such segment can be found by using 
Algorithm LongestSegment with scores (a* — (3) and threshold a = 0. 
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Remark. Kuan-Yu Chen and Kun-Mao Chao’s paper [4] about the same 
problem came out in print while this paper was under review. (So this will 
always remain a preprint.) Their algorithm works in an on-line setting. 
They also show the reduction for finding the shortest segment which scores 
above a given cutoff. 
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